-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
positive item sampling and fix infinite loop #7
base: main
Are you sure you want to change the base?
Conversation
isipalma
commented
Jun 16, 2021
- Put a limit in sampling loop to prevent infinite loop
- Change method for sampling. Always the positive item is in the profile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions about the code and caught a possible bug
"# Mark interactions used for evaluation procedure if needed\n", | ||
"if \"evaluation\" not in interactions_df:\n", | ||
" print(\"\\nApply evaluation split...\")\n", | ||
" interactions_df = mark_evaluation_rows(interactions_df)\n", | ||
" # Check if new column exists and has boolean dtype\n", | ||
" assert interactions_df[\"evaluation\"].dtype.name == \"bool\"\n", | ||
" print(f\">> Interactions: {interactions_df.shape}\")\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot, why was this needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed, the code was not present in this repository but it was in mine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctly.
@@ -202,22 +210,27 @@ | |||
"metadata": {}, | |||
"outputs": [], | |||
"source": [ | |||
"def random_triplet_sampling(samples_per_user, hashes_container, desc=None):\n", | |||
"def random_triplet_sampling(samples_per_user, hashes_container, desc=None, limit_iteration=10000):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 10000? Maybe we could use the number of interaction as limit, or a proportion of said number. If I have a million records, and need to sample an important number of it, a proportion of len(interactions_df)
(or interactions_df.size
, not sure which one is better) would be more appropriate than a fixed number
" aux_limit = limit_iteration\n", | ||
" while n > 0:\n", | ||
" if aux_limit == 0:\n", | ||
" break\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aux_limit
does not change its value, in line 247 we should use aux_limit
instead of limit_iteration
and that may be a fix
"assert len(samples_training) >= TOTAL_SAMPLES_TRAIN\n", | ||
"assert len(samples_testing) >= TOTAL_SAMPLES_VALID\n", | ||
"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this removed?